Given the rapidly growing amount of literature on COVID-19, it is difficult to keep up with the major research trends being explored on this topic. Can we cluster similar research articles to make it easier for health professionals to find relevant research trends?
In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19). CORD-19 is a resource of over 29,000 scholarly articles, including over 13,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset is provided to the global research community to apply recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease. There is a growing urgency for these approaches because of the rapid acceleration in new coronavirus literature, making it difficult for the medical research community to keep up.
Cite: COVID-19 Open Research Dataset Challenge (CORD-19) | Kaggle
The data loading procedure described below is adapted from the following notebook by Ivan Ega Pratama, from Kaggle Dataset Parsing Code | Kaggle, COVID EDA: Initial Exploration Tool. Before running the procedures below, make sure to unzip the file CORD-19-research-challenge.zip inside the Experiment3 folder.
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import glob
import json
import matplotlib.pyplot as plt
plt.style.use('ggplot')
Let's load the metadata of the articles. 'title' and 'journal' attributes may be useful later when we cluster the articles to see what kinds of articles cluster together.
root_path = 'CORD-19-research-challenge/'
metadata_path = f'{root_path}/metadata.csv'
meta_df = pd.read_csv(metadata_path, dtype={
'pubmed_id': str,
'Microsoft Academic Paper ID': str,
'doi': str
})
meta_df.head()
| cord_uid | sha | source_x | title | doi | pmcid | pubmed_id | license | abstract | publish_time | authors | journal | Microsoft Academic Paper ID | WHO #Covidence | has_pdf_parse | has_pmc_xml_parse | full_text_file | url | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | xqhn0vbp | 1e1286db212100993d03cc22374b624f7caee956 | PMC | Airborne rhinovirus detection and effect of ul... | 10.1186/1471-2458-3-5 | PMC140314 | 12525263 | no-cc | BACKGROUND: Rhinovirus, the most common cause ... | 2003-01-13 | Myatt, Theodore A; Johnston, Sebastian L; Rudn... | BMC Public Health | NaN | NaN | True | True | custom_license | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1... |
| 1 | gi6uaa83 | 8ae137c8da1607b3a8e4c946c07ca8bda67f88ac | PMC | Discovering human history from stomach bacteria | 10.1186/gb-2003-4-5-213 | PMC156578 | 12734001 | no-cc | Recent analyses of human pathogens have reveal... | 2003-04-28 | Disotell, Todd R | Genome Biol | NaN | NaN | True | True | custom_license | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1... |
| 2 | le0ogx1s | NaN | PMC | A new recruit for the army of the men of death | 10.1186/gb-2003-4-7-113 | PMC193621 | 12844350 | no-cc | The army of the men of death, in John Bunyan's... | 2003-06-27 | Petsko, Gregory A | Genome Biol | NaN | NaN | False | True | custom_license | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1... |
| 3 | fy4w7xz8 | 0104f6ceccf92ae8567a0102f89cbb976969a774 | PMC | Association of HLA class I with severe acute r... | 10.1186/1471-2350-4-9 | PMC212558 | 12969506 | no-cc | BACKGROUND: The human leukocyte antigen (HLA) ... | 2003-09-12 | Lin, Marie; Tseng, Hsiang-Kuang; Trejaut, Jean... | BMC Med Genet | NaN | NaN | True | True | custom_license | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2... |
| 4 | 0qaoam29 | 5b68a553a7cbbea13472721cd1ad617d42b40c26 | PMC | A double epidemic model for the SARS propagation | 10.1186/1471-2334-3-19 | PMC222908 | 12964944 | no-cc | BACKGROUND: An epidemic of a Severe Acute Resp... | 2003-09-10 | Ng, Tuen Wai; Turinici, Gabriel; Danchin, Antoine | BMC Infect Dis | NaN | NaN | True | True | custom_license | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2... |
meta_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 51078 entries, 0 to 51077 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 cord_uid 51078 non-null object 1 sha 38022 non-null object 2 source_x 51078 non-null object 3 title 50920 non-null object 4 doi 47741 non-null object 5 pmcid 41082 non-null object 6 pubmed_id 37861 non-null object 7 license 51078 non-null object 8 abstract 42352 non-null object 9 publish_time 51070 non-null object 10 authors 48891 non-null object 11 journal 46368 non-null object 12 Microsoft Academic Paper ID 964 non-null object 13 WHO #Covidence 1768 non-null object 14 has_pdf_parse 51078 non-null bool 15 has_pmc_xml_parse 51078 non-null bool 16 full_text_file 42511 non-null object 17 url 50776 non-null object dtypes: bool(2), object(16) memory usage: 6.3+ MB
Next, we will get the paths to all JSON files:
all_json = glob.glob(f'{root_path}/**/*.json', recursive=True)
len(all_json)
59311
We need the following helper functions for reading files and adding breaks after every word.
class FileReader:
def __init__(self, file_path):
with open(file_path) as file:
content = json.load(file)
self.paper_id = content['paper_id']
self.abstract = []
self.body_text = []
# Abstract
for entry in content['abstract']:
self.abstract.append(entry['text'])
# Body text
for entry in content['body_text']:
self.body_text.append(entry['text'])
self.abstract = '\n'.join(self.abstract)
self.body_text = '\n'.join(self.body_text)
def __repr__(self):
return f'{self.paper_id}: {self.abstract[:200]}... {self.body_text[:200]}...'
# Helper function adds break after every words when character length reach to certain amount. This is for the interactive plot so that hover tool fits the screen.
def get_breaks(content, length):
data = ""
words = content.split(' ')
total_chars = 0
# add break every length characters
for i in range(len(words)):
total_chars += len(words[i])
if total_chars > length:
data = data + "<br>" + words[i]
total_chars = 0
else:
data = data + " " + words[i]
return data
Using theses helper functions, let's read in the articles into a DataFrame that can be used easily:
dict_ = {'paper_id': [], 'abstract': [], 'body_text': [], 'authors': [], 'title': [], 'journal': [], 'abstract_summary': []}
for idx, entry in enumerate(all_json):
try:
if idx % (len(all_json) // 10) == 0:
print(f'Processing index: {idx} of {len(all_json)}')
content = FileReader(entry)
# get metadata information
meta_data = meta_df.loc[meta_df['sha'] == content.paper_id]
# no metadata, skip this paper
if len(meta_data) == 0:
continue
dict_['paper_id'].append(content.paper_id)
dict_['abstract'].append(content.abstract)
dict_['body_text'].append(content.body_text)
# also create a column for the summary of abstract to be used in a plot
if len(content.abstract) == 0:
# no abstract provided
dict_['abstract_summary'].append("Not provided.")
elif len(content.abstract.split(' ')) > 100:
# abstract provided is too long for plot, take first 300 words append with ...
info = content.abstract.split(' ')[:100]
summary = get_breaks(' '.join(info), 40)
dict_['abstract_summary'].append(summary + "...")
else:
# abstract is short enough
summary = get_breaks(content.abstract, 40)
dict_['abstract_summary'].append(summary)
# get metadata information
meta_data = meta_df.loc[meta_df['sha'] == content.paper_id]
try:
# if more than one author
authors = meta_data['authors'].values[0].split(';')
if len(authors) > 2:
# more than 2 authors, may be problem when plotting, so take first 2 append with ...
dict_['authors'].append(". ".join(authors[:2]) + "...")
else:
# authors will fit in plot
dict_['authors'].append(". ".join(authors))
except Exception as e:
# if only one author - or Null valie
dict_['authors'].append(meta_data['authors'].values[0])
# add the title information, add breaks when needed
try:
title = get_breaks(meta_data['title'].values[0], 40)
dict_['title'].append(title)
# if title was not provided
except Exception as e:
dict_['title'].append(meta_data['title'].values[0])
# add the journal information
dict_['journal'].append(meta_data['journal'].values[0])
except Exception as e:
continue
df_covid = pd.DataFrame(dict_, columns=['paper_id', 'abstract', 'body_text', 'authors', 'title', 'journal', 'abstract_summary'])
df_covid.head()
Processing index: 0 of 59311 Processing index: 5931 of 59311 Processing index: 11862 of 59311 Processing index: 17793 of 59311 Processing index: 23724 of 59311 Processing index: 29655 of 59311 Processing index: 35586 of 59311 Processing index: 41517 of 59311 Processing index: 47448 of 59311 Processing index: 53379 of 59311 Processing index: 59310 of 59311
| paper_id | abstract | body_text | authors | title | journal | abstract_summary | |
|---|---|---|---|---|---|---|---|
| 0 | 0015023cc06b5362d332b3baf348d11567ca2fbb | word count: 194 22 Text word count: 5168 23 24... | VP3, and VP0 (which is further processed to VP... | Joseph C. Ward. Lidia Lasecka-Dykes... | The RNA pseudoknots in foot-and-mouth disease... | NaN | word count: 194 22 Text word count: 5168 23 2... |
| 1 | 00340eea543336d54adda18236424de6a5e91c9d | During the past three months, a new coronaviru... | In December 2019, a novel coronavirus, SARS-Co... | Carla Mavian. Simone Marini... | Regaining perspective on SARS-CoV-2<br>molecu... | NaN | During the past three months, a new coronavir... |
| 2 | 004f0f8bb66cf446678dc13cf2701feec4f36d76 | The 2019-nCoV epidemic has spread across China... | Hanchu Zhou. Jianan Yang... | Healthcare-resource-adjusted<br>vulnerabiliti... | NaN | Not provided. | |
| 3 | 00911cf4f99a3d5ae5e5b787675646a743574496 | The fast accumulation of viral metagenomic dat... | Metagenomic sequencing, which allows us to dir... | Jiayu Shang. Yanni Sun | CHEER: hierarCHical taxonomic<br>classificati... | NaN | The fast accumulation of viral metagenomic<br... |
| 4 | 00d16927588fb04d4be0e6b269fc02f0d3c2aa7b | Infectious bronchitis (IB) causes significant ... | Infectious bronchitis (IB), which is caused by... | Salman L. Butt. Eric C. Erwood... | Real-time, MinION-based, amplicon<br>sequenci... | NaN | Infectious bronchitis (IB) causes<br>signific... |
dict_ = None
We add two extra columns related to the word count of the abstract and body_text, which can be useful features later:
df_covid['abstract_word_count'] = df_covid['abstract'].apply(lambda x: len(x.strip().split()))
df_covid['body_word_count'] = df_covid['body_text'].apply(lambda x: len(x.strip().split()))
df_covid.head()
| paper_id | abstract | body_text | authors | title | journal | abstract_summary | abstract_word_count | body_word_count | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0015023cc06b5362d332b3baf348d11567ca2fbb | word count: 194 22 Text word count: 5168 23 24... | VP3, and VP0 (which is further processed to VP... | Joseph C. Ward. Lidia Lasecka-Dykes... | The RNA pseudoknots in foot-and-mouth disease... | NaN | word count: 194 22 Text word count: 5168 23 2... | 241 | 1728 |
| 1 | 00340eea543336d54adda18236424de6a5e91c9d | During the past three months, a new coronaviru... | In December 2019, a novel coronavirus, SARS-Co... | Carla Mavian. Simone Marini... | Regaining perspective on SARS-CoV-2<br>molecu... | NaN | During the past three months, a new coronavir... | 175 | 2549 |
| 2 | 004f0f8bb66cf446678dc13cf2701feec4f36d76 | The 2019-nCoV epidemic has spread across China... | Hanchu Zhou. Jianan Yang... | Healthcare-resource-adjusted<br>vulnerabiliti... | NaN | Not provided. | 0 | 755 | |
| 3 | 00911cf4f99a3d5ae5e5b787675646a743574496 | The fast accumulation of viral metagenomic dat... | Metagenomic sequencing, which allows us to dir... | Jiayu Shang. Yanni Sun | CHEER: hierarCHical taxonomic<br>classificati... | NaN | The fast accumulation of viral metagenomic<br... | 139 | 5188 |
| 4 | 00d16927588fb04d4be0e6b269fc02f0d3c2aa7b | Infectious bronchitis (IB) causes significant ... | Infectious bronchitis (IB), which is caused by... | Salman L. Butt. Eric C. Erwood... | Real-time, MinION-based, amplicon<br>sequenci... | NaN | Infectious bronchitis (IB) causes<br>signific... | 1647 | 4003 |
df_covid.describe(include='all')
| paper_id | abstract | body_text | authors | title | journal | abstract_summary | abstract_word_count | body_word_count | |
|---|---|---|---|---|---|---|---|---|---|
| count | 36009 | 36009 | 36009 | 35413 | 35973 | 34277 | 36009 | 36009.000000 | 36009.000000 |
| unique | 36009 | 26249 | 35981 | 33538 | 35652 | 5410 | 26239 | NaN | NaN |
| top | 0015023cc06b5362d332b3baf348d11567ca2fbb | In previous reports, workers have characterize... | Domingo, Esteban | In the Literature | PLoS One | Not provided. | NaN | NaN | |
| freq | 1 | 9704 | 3 | 14 | 9 | 1518 | 9704 | NaN | NaN |
| mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 160.511678 | 4705.127163 |
| std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 168.348075 | 6944.838042 |
| min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.000000 | 1.000000 |
| 25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.000000 | 2370.000000 |
| 50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 158.000000 | 3645.000000 |
| 75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 235.000000 | 5450.000000 |
| max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 4767.000000 | 260378.000000 |
When we look at the unique values above, we can see that there are duplicates. It may have caused because of author submiting the article to multiple journals. Let's remove the duplicates from our dataset, and also those articles that are missing abstracts or body_text
df_covid.dropna(inplace=True)
df_covid = df_covid[df_covid.abstract != ''] #Remove rows which are missing abstracts
df_covid = df_covid[df_covid.body_text != ''] #Remove rows which are missing body_text
df_covid.drop_duplicates(['abstract', 'body_text'], inplace=True) # remove duplicate rows having same abstract and body_text
df_covid.describe(include='all')
| paper_id | abstract | body_text | authors | title | journal | abstract_summary | abstract_word_count | body_word_count | |
|---|---|---|---|---|---|---|---|---|---|
| count | 24584 | 24584 | 24584 | 24584 | 24584 | 24584 | 24584 | 24584.000000 | 24584.000000 |
| unique | 24584 | 24552 | 24584 | 23709 | 24545 | 3963 | 24545 | NaN | NaN |
| top | 00142f93c18b07350be89e96372d240372437ed9 | Travel Medicine and Infectious Disease xxx (xx... | iNTRODUCTiON Human beings are constantly expos... | Woo, Patrick C. Y.. Lau, Susanna K. P.... | Respiratory Infections | PLoS One | Travel Medicine and Infectious Disease xxx<br... | NaN | NaN |
| freq | 1 | 5 | 1 | 7 | 3 | 1514 | 5 | NaN | NaN |
| mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 216.446673 | 4435.475106 |
| std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 137.065117 | 3657.421423 |
| min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.000000 | 23.000000 |
| 25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 147.000000 | 2711.000000 |
| 50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 200.000000 | 3809.500000 |
| 75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 255.000000 | 5431.000000 |
| max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 3694.000000 | 232431.000000 |
df_covid.head()
| paper_id | abstract | body_text | authors | title | journal | abstract_summary | abstract_word_count | body_word_count | |
|---|---|---|---|---|---|---|---|---|---|
| 1625 | 00142f93c18b07350be89e96372d240372437ed9 | Dendritic cells (DCs) are specialized antigen-... | iNTRODUCTiON Human beings are constantly expos... | Geginat, Jens. Nizzoli, Giulia... | Immunity to Pathogens Taught by Specialized<b... | Front Immunol | Dendritic cells (DCs) are specialized<br>anti... | 309 | 5305 |
| 1626 | 0022796bb2112abd2e6423ba2d57751db06049fb | Dengue has a negative impact in low-and lower ... | Pathogens and vectors can now be transported r... | Viennet, Elvina. Ritchie, Scott A.... | Public Health Responses to and Challenges for... | PLoS Negl Trop Dis | Dengue has a negative impact in low-and lower... | 276 | 7288 |
| 1627 | 0031e47b76374e05a18c266bd1a1140e5eacb54f | Fecal microbial transplantation (FMT), a treat... | a1111111111 a1111111111 a1111111111 a111111111... | McKinney, Caroline A.. Oliveira, Bruno C. M.... | The fecal microbiota of healthy donor horses<... | PLoS One | Fecal microbial transplantation (FMT), a<br>t... | 141 | 4669 |
| 1628 | 00326efcca0852dc6e39dc6b7786267e1bc4f194 | Fifteen years ago, United Nations world leader... | In addition to preventative care and nutrition... | Turner, Erin L.. Nielsen, Katie R.... | A Review of Pediatric Critical Care in<br>Res... | Front Pediatr | Fifteen years ago, United Nations world<br>le... | 151 | 7593 |
| 1629 | 00352a58c8766861effed18a4b079d1683fec2ec | Posttranslational modification of proteins by ... | Ubiquitination is a widely used posttranslatio... | Hodul, Molly. Dahlberg, Caroline L.... | Function of the Deubiquitinating Enzyme USP46... | Front Synaptic Neurosci | Posttranslational modification of proteins<br... | 148 | 3156 |
df_covid.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 24584 entries, 1625 to 36008 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 paper_id 24584 non-null object 1 abstract 24584 non-null object 2 body_text 24584 non-null object 3 authors 24584 non-null object 4 title 24584 non-null object 5 journal 24584 non-null object 6 abstract_summary 24584 non-null object 7 abstract_word_count 24584 non-null int64 8 body_word_count 24584 non-null int64 dtypes: int64(2), object(7) memory usage: 1.9+ MB
Let us limit the number of articles to speed up computation:
df_covid = df_covid.head(12500)
Now let's remove punctuation from each text:
import re
df_covid['body_text'] = df_covid['body_text'].apply(lambda x: re.sub('[^a-zA-z0-9\s]','',x))
df_covid['abstract'] = df_covid['abstract'].apply(lambda x: re.sub('[^a-zA-z0-9\s]','',x))
Convert each text to lower case:
def lower_case(input_str):
input_str = input_str.lower()
return input_str
df_covid['body_text'] = df_covid['body_text'].apply(lambda x: lower_case(x))
df_covid['abstract'] = df_covid['abstract'].apply(lambda x: lower_case(x))
df_covid.head(4)
| paper_id | abstract | body_text | authors | title | journal | abstract_summary | abstract_word_count | body_word_count | |
|---|---|---|---|---|---|---|---|---|---|
| 1625 | 00142f93c18b07350be89e96372d240372437ed9 | dendritic cells dcs are specialized antigenpre... | introduction human beings are constantly expos... | Geginat, Jens. Nizzoli, Giulia... | Immunity to Pathogens Taught by Specialized<b... | Front Immunol | Dendritic cells (DCs) are specialized<br>anti... | 309 | 5305 |
| 1626 | 0022796bb2112abd2e6423ba2d57751db06049fb | dengue has a negative impact in lowand lower m... | pathogens and vectors can now be transported r... | Viennet, Elvina. Ritchie, Scott A.... | Public Health Responses to and Challenges for... | PLoS Negl Trop Dis | Dengue has a negative impact in low-and lower... | 276 | 7288 |
| 1627 | 0031e47b76374e05a18c266bd1a1140e5eacb54f | fecal microbial transplantation fmt a treatmen... | a1111111111 a1111111111 a1111111111 a111111111... | McKinney, Caroline A.. Oliveira, Bruno C. M.... | The fecal microbiota of healthy donor horses<... | PLoS One | Fecal microbial transplantation (FMT), a<br>t... | 141 | 4669 |
| 1628 | 00326efcca0852dc6e39dc6b7786267e1bc4f194 | fifteen years ago united nations world leaders... | in addition to preventative care and nutrition... | Turner, Erin L.. Nielsen, Katie R.... | A Review of Pediatric Critical Care in<br>Res... | Front Pediatr | Fifteen years ago, United Nations world<br>le... | 151 | 7593 |
Now that we have the text cleaned up, we can create our features vector which can be fed into a clustering or dimensionality reduction algorithm. For our first try, we will focus on the text on the body of the articles. Let's grab that:
text = df_covid.drop(["paper_id", "abstract", "abstract_word_count", "body_word_count", "authors", "title", "journal", "abstract_summary"], axis=1)
text.head(5)
| body_text | |
|---|---|
| 1625 | introduction human beings are constantly expos... |
| 1626 | pathogens and vectors can now be transported r... |
| 1627 | a1111111111 a1111111111 a1111111111 a111111111... |
| 1628 | in addition to preventative care and nutrition... |
| 1629 | ubiquitination is a widely used posttranslatio... |
Let's transform this 1D DataFrame into a 1D list where each index is an article (instance), so that we can work with words from each instance:
text_arr = text.stack().tolist()
len(text_arr)
12500
Next, let's create a 2D list, where each row is an instance and each column is a word. Meaning, we will separate each instance into words:
words = []
for ii in range(0,len(text)):
words.append(str(text.iloc[ii]['body_text']).split(" "))
print(words[0][:20])
['introduction', 'human', 'beings', 'are', 'constantly', 'exposed', 'to', 'a', 'myriad', 'of', 'pathogens', 'including', 'bacteria', 'fungi', 'and', 'viruses', 'these', 'foreign', 'invaders', 'or']
What we want now is to create n-grams from the words with n=2 (i.e., 2-grams). A 2-gram is a sequence of two words appearing together (e.g., 'thank you'). The motivation behind using 2-grams is to describe a document using pairs of consecutive words, instead of individual words, so as to capture the co-occurrence information of words at adjacent positions in a document. We have created a 2D list where each row is an instance (or document) and every column is a word. We need to create a similar 2D list where every row is an instance and every column is a 2-gram.
n_gram_all = []
for word in words:
# get n-grams for the instance
n_gram = []
for i in range(len(word)-2+1):
n_gram.append("".join(word[i:i+2]))
n_gram_all.append(n_gram)
n_gram_all[0][:10]
['introductionhuman', 'humanbeings', 'beingsare', 'areconstantly', 'constantlyexposed', 'exposedto', 'toa', 'amyriad', 'myriadof', 'ofpathogens']
M = len(n_gram_all)
N = len(n_gram_all[1])
print("{} X {}".format(M, N))
12500 X 7228
# Answer 2
words_n_gram_all = []
new_words = [['the', '2019', 'novel', 'coronavirus', 'sarscov2', 'identified', 'as', 'the', 'cause']]
for word in new_words:
# get n-grams for the instance
n_gram = []
for i in range(len(word)-2+1):
n_gram.append("".join(word[i:i+2]))
words_n_gram_all.append(n_gram)
words_n_gram_all[0][:]
['the2019', '2019novel', 'novelcoronavirus', 'coronavirussarscov2', 'sarscov2identified', 'identifiedas', 'asthe', 'thecause']
To vectorize the set of n-gram features constructed for every document, we will use the in-built HashVectorizer function. We will limit the feature size to 2^12 (4096) to speed up computations. We might need to increase this later to improve the accuracy:
from sklearn.feature_extraction.text import HashingVectorizer
# hash vectorizer instance
hvec = HashingVectorizer(lowercase=False, analyzer=lambda l:l, n_features=2**12)
# features matrix X
X = hvec.fit_transform(n_gram_all)
X.shape
(12500, 4096)
We first reduce the dimensionality of X from a 4096-dimensional space to a 2-dimensional space using a popular non-linear dimensionality reduction technique called t-SNE. In this process, t-SNE will keep similar instances together while trying to push different instances far from each other. Resulting 2-D scatter plot of t-SNE features can be useful to see which articles cluster near each other.
# Following cell may take 20-30 minutes to run
from sklearn.manifold import TSNE
tsne = TSNE(verbose=1, perplexity=5)
X_embedded = tsne.fit_transform(X.toarray())
C:\Users\ankit\anaconda3\lib\site-packages\sklearn\manifold\_t_sne.py:780: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2. warnings.warn( C:\Users\ankit\anaconda3\lib\site-packages\sklearn\manifold\_t_sne.py:790: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2. warnings.warn(
[t-SNE] Computing 16 nearest neighbors... [t-SNE] Indexed 12500 samples in 0.075s... [t-SNE] Computed neighbors for 12500 samples in 18.400s... [t-SNE] Computed conditional probabilities for sample 1000 / 12500 [t-SNE] Computed conditional probabilities for sample 2000 / 12500 [t-SNE] Computed conditional probabilities for sample 3000 / 12500 [t-SNE] Computed conditional probabilities for sample 4000 / 12500 [t-SNE] Computed conditional probabilities for sample 5000 / 12500 [t-SNE] Computed conditional probabilities for sample 6000 / 12500 [t-SNE] Computed conditional probabilities for sample 7000 / 12500 [t-SNE] Computed conditional probabilities for sample 8000 / 12500 [t-SNE] Computed conditional probabilities for sample 9000 / 12500 [t-SNE] Computed conditional probabilities for sample 10000 / 12500 [t-SNE] Computed conditional probabilities for sample 11000 / 12500 [t-SNE] Computed conditional probabilities for sample 12000 / 12500 [t-SNE] Computed conditional probabilities for sample 12500 / 12500 [t-SNE] Mean sigma: 0.128966 [t-SNE] KL divergence after 250 iterations with early exaggeration: 149.291901 [t-SNE] KL divergence after 1000 iterations: 4.166944
Let's plot the result:
from matplotlib import pyplot as plt
import seaborn as sns
# sns settings
sns.set(rc={'figure.figsize':(15,15)})
# colors
palette = sns.color_palette("bright", 1)
# plot
sns.scatterplot(X_embedded[:,0], X_embedded[:,1], palette=palette)
plt.title("t-SNE Covid-19 Articles")
# plt.savefig("plots/t-sne_covid19.png")
plt.show()
C:\Users\ankit\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
We can clearly see few clusters forming. However, without labels, it is difficult to see the clusters. Let's try if we can use K-Means to generate our labels for these clusters. We can later use this information to produce a scatterplot with labels to verify the clusters.
Let us apply K-means with K = 10. For computational efficiency, we will use the Minibatch version of Kmeans.
from sklearn.cluster import MiniBatchKMeans
k = 10
kmeans = MiniBatchKMeans(n_clusters=k)
y_pred = kmeans.fit_predict(X)
Now that we have the labels, let's plot the t-SNE scatterplot again and see if K-means is able to capture the pattern of clusters in the data:
# sns settings
sns.set(rc={'figure.figsize':(15,15)})
# colors
palette = sns.color_palette("bright", len(set(y_pred)))
# plot
sns.scatterplot(X_embedded[:,0], X_embedded[:,1], hue=y_pred, legend='full', palette=palette)
plt.title("t-SNE Covid-19 Articles - Clustered")
# plt.savefig("plots/t-sne_covid19_label.png")
plt.show()
C:\Users\ankit\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
This looks pretty promising. We can see that articles from the same cluster are near each other, forming groups. There are still overlaps, though. So we will have to see if we can improve this by changing the number of clusters (K), using another clustering algorithm, or using a different feature size in the HashingVectorizer. We can also consider trying 3-grams, 4-grams, or 1-gram (plain text) instead of 2-grams to create features and vectorize them using different document vectorization methods, e.g., HashVectorizer or Tf-idfVectorizer.
# Answer 3
k = 18
kmeans = MiniBatchKMeans(n_clusters=k)
y_pred = kmeans.fit_predict(X)
# sns settings
sns.set(rc={'figure.figsize':(15,15)})
# colors
palette = sns.color_palette("bright", len(set(y_pred)))
# plot
sns.scatterplot(X_embedded[:,0], X_embedded[:,1], hue=y_pred, legend='full', palette=palette)
plt.title("t-SNE Covid-19 Articles - Clustered")
plt.savefig("t-sne_covid19_label.png")
plt.show()
C:\Users\ankit\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
Let's see if we will be able to get better clusters using plain text as instances rather than 2-grams and vectorize it using Tf-idf.
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=2**12)
X = vectorizer.fit_transform(df_covid['body_text'].values)
X.shape
(12500, 4096)
Again, let's try to get our cluster labels. We will choose 10 clusters again.
from sklearn.cluster import MiniBatchKMeans
k = 10
kmeans = MiniBatchKMeans(n_clusters=k)
y_pred = kmeans.fit_predict(X)
Get the labels:
y = y_pred
Let's reduce the dimensionality using t-SNE again:
# Following cell will take 20-30 minutes to run
from sklearn.manifold import TSNE
tsne = TSNE(verbose=1)
X_embedded = tsne.fit_transform(X.toarray())
C:\Users\ankit\anaconda3\lib\site-packages\sklearn\manifold\_t_sne.py:780: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2. warnings.warn( C:\Users\ankit\anaconda3\lib\site-packages\sklearn\manifold\_t_sne.py:790: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2. warnings.warn(
[t-SNE] Computing 91 nearest neighbors... [t-SNE] Indexed 12500 samples in 0.132s... [t-SNE] Computed neighbors for 12500 samples in 17.368s... [t-SNE] Computed conditional probabilities for sample 1000 / 12500 [t-SNE] Computed conditional probabilities for sample 2000 / 12500 [t-SNE] Computed conditional probabilities for sample 3000 / 12500 [t-SNE] Computed conditional probabilities for sample 4000 / 12500 [t-SNE] Computed conditional probabilities for sample 5000 / 12500 [t-SNE] Computed conditional probabilities for sample 6000 / 12500 [t-SNE] Computed conditional probabilities for sample 7000 / 12500 [t-SNE] Computed conditional probabilities for sample 8000 / 12500 [t-SNE] Computed conditional probabilities for sample 9000 / 12500 [t-SNE] Computed conditional probabilities for sample 10000 / 12500 [t-SNE] Computed conditional probabilities for sample 11000 / 12500 [t-SNE] Computed conditional probabilities for sample 12000 / 12500 [t-SNE] Computed conditional probabilities for sample 12500 / 12500 [t-SNE] Mean sigma: 0.206152 [t-SNE] KL divergence after 250 iterations with early exaggeration: 96.178764 [t-SNE] KL divergence after 1000 iterations: 2.357430
from matplotlib import pyplot as plt
import seaborn as sns
# sns settings
sns.set(rc={'figure.figsize':(15,15)})
# colors
palette = sns.color_palette("bright", len(set(y)))
# plot
sns.scatterplot(X_embedded[:,0], X_embedded[:,1], hue=y, legend='full', palette=palette)
plt.title("t-SNE Covid-19 Articles - Clustered(K-Means) - Tf-idf with Plain Text")
# plt.savefig("plots/t-sne_covid19_label_TFID.png")
plt.show()
C:\Users\ankit\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
This time we are able to see the clusters more clearly, as the clusters are further apart from each other. We can also start to see that there is possibly more than 10 clusters we need to identify using k-means.
# Answer 4
# tf-idf vectorizer on the 2-gram representation of documents instead of plain text,
#vectorize 2-gram representation using tf-idf
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(lowercase=False, analyzer=lambda l:l, max_features=2**12)
X = vectorizer.fit_transform(n_gram_all)
# apply k means
from sklearn.cluster import MiniBatchKMeans
k = 10
kmeans = MiniBatchKMeans(n_clusters=k)
y1 = kmeans.fit_predict(X)
# tsne
from sklearn.manifold import TSNE
tsne = TSNE(verbose=1)
X_embedded = tsne.fit_transform(X.toarray())
# sns settings
sns.set(rc={'figure.figsize':(15,15)})
# colors
palette = sns.color_palette("bright", len(set(y1)))
# plot
sns.scatterplot(X_embedded[:,0], X_embedded[:,1], hue=y1, legend='full', palette=palette)
plt.title("t-SNE Covid-19 Articles - Clustered(K-Means) - Tf-idf with Plain Text")
# plt.savefig("plots/t-sne_covid19_label_TFID.png")
plt.show()
C:\Users\ankit\anaconda3\lib\site-packages\sklearn\manifold\_t_sne.py:780: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2. warnings.warn( C:\Users\ankit\anaconda3\lib\site-packages\sklearn\manifold\_t_sne.py:790: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2. warnings.warn(
[t-SNE] Computing 91 nearest neighbors... [t-SNE] Indexed 12500 samples in 0.045s... [t-SNE] Computed neighbors for 12500 samples in 13.283s... [t-SNE] Computed conditional probabilities for sample 1000 / 12500 [t-SNE] Computed conditional probabilities for sample 2000 / 12500 [t-SNE] Computed conditional probabilities for sample 3000 / 12500 [t-SNE] Computed conditional probabilities for sample 4000 / 12500 [t-SNE] Computed conditional probabilities for sample 5000 / 12500 [t-SNE] Computed conditional probabilities for sample 6000 / 12500 [t-SNE] Computed conditional probabilities for sample 7000 / 12500 [t-SNE] Computed conditional probabilities for sample 8000 / 12500 [t-SNE] Computed conditional probabilities for sample 9000 / 12500 [t-SNE] Computed conditional probabilities for sample 10000 / 12500 [t-SNE] Computed conditional probabilities for sample 11000 / 12500 [t-SNE] Computed conditional probabilities for sample 12000 / 12500 [t-SNE] Computed conditional probabilities for sample 12500 / 12500 [t-SNE] Mean sigma: 0.265925 [t-SNE] KL divergence after 250 iterations with early exaggeration: 122.639458 [t-SNE] KL divergence after 1000 iterations: 3.281467
C:\Users\ankit\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
t-SNE doesn't scale well. This is why the run-time of this Notebook is about 40 minutes to 1 hour with an average computer. Let's try to see if we can achieve reasonable results with PCA as it scales very well with larger datasets and dimensions:
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
pca_result = pca.fit_transform(X.toarray())
# sns settings
sns.set(rc={'figure.figsize':(15,15)})
# colors
palette = sns.color_palette("bright", len(set(y)))
# plot
sns.scatterplot(pca_result[:,0], pca_result[:,1], hue=y, legend='full', palette=palette)
plt.title("PCA Covid-19 Articles - Clustered (K-Means) - Tf-idf with Plain Text")
# plt.savefig("plots/pca_covid19_label_TFID.png")
plt.show()
C:\Users\ankit\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
Sometimes it may be easier to see the results in a 3 dimensional plot. So let's try to do that:
%matplotlib inline
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
ax = plt.figure(figsize=(16,10)).gca(projection='3d')
ax.scatter(
xs=pca_result[:,0],
ys=pca_result[:,1],
zs=pca_result[:,2],
c=y,
cmap='tab10'
)
ax.set_xlabel('pca-one')
ax.set_ylabel('pca-two')
ax.set_zlabel('pca-three')
plt.title("PCA Covid-19 Articles (3D) - Clustered (K-Means) - Tf-idf with Plain Text")
# plt.savefig("plots/pca_covid19_label_TFID_3d.png")
plt.show()
C:\Users\ankit\AppData\Local\Temp\ipykernel_19724\3963984120.py:5: MatplotlibDeprecationWarning: Calling gca() with keyword arguments was deprecated in Matplotlib 3.4. Starting two minor releases later, gca() will take no keyword arguments. The gca() function should only be used to get the current axes, or if no axes exist, create new axes with default keyword arguments. To create a new axes with non-default arguments, use plt.axes() or plt.subplot(). ax = plt.figure(figsize=(16,10)).gca(projection='3d')
On our previous plot we could see that there is more clusters than only 10. Let's try to label them:
from sklearn.cluster import MiniBatchKMeans
k = 20
kmeans = MiniBatchKMeans(n_clusters=k)
y_pred = kmeans.fit_predict(X)
y = y_pred
from matplotlib import pyplot as plt
import seaborn as sns
import random
# sns settings
sns.set(rc={'figure.figsize':(15,15)})
# let's shuffle the list so distinct colors stay next to each other
palette = sns.hls_palette(20, l=.4, s=.9)
random.shuffle(palette)
# plot
sns.scatterplot(X_embedded[:,0], X_embedded[:,1], hue=y, legend='full', palette=palette)
plt.title("t-SNE Covid-19 Articles - Clustered(K-Means) - Tf-idf with Plain Text")
# plt.savefig("plots/t-sne_covid19_20label_TFID.png")
plt.show()
C:\Users\ankit\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
It would be helpful if we have a demo tool that can be used to see what articles are identified as similar using our Clustering and Dimensionality Reduction methods, right? Let's put together a interactive scatter plot of t-SNE to do that.
from bokeh.models import ColumnDataSource, HoverTool, LinearColorMapper, CustomJS
from bokeh.palettes import Category20
from bokeh.transform import linear_cmap
from bokeh.io import output_file, show
from bokeh.transform import transform
from bokeh.io import output_notebook
from bokeh.plotting import figure
from bokeh.layouts import column
from bokeh.models import RadioButtonGroup
from bokeh.models import TextInput
from bokeh.layouts import gridplot
from bokeh.models import Div
from bokeh.models import Paragraph
from bokeh.layouts import column, widgetbox
output_notebook()
y_labels = y_pred
# data sources
source = ColumnDataSource(data=dict(
x= X_embedded[:,0],
y= X_embedded[:,1],
x_backup = X_embedded[:,0],
y_backup = X_embedded[:,1],
desc= y_labels,
titles= df_covid['title'],
authors = df_covid['authors'],
journal = df_covid['journal'],
abstract = df_covid['abstract_summary'],
labels = ["C-" + str(x) for x in y_labels]
))
# hover over information
hover = HoverTool(tooltips=[
("Title", "@titles{safe}"),
("Author(s)", "@authors"),
("Journal", "@journal"),
("Abstract", "@abstract{safe}"),
],
point_policy="follow_mouse")
# map colors
mapper = linear_cmap(field_name='desc',
palette=Category20[20],
low=min(y_labels) ,high=max(y_labels))
# prepare the figure
p = figure(plot_width=800, plot_height=800,
tools=[hover, 'pan', 'wheel_zoom', 'box_zoom', 'reset'],
title="t-SNE Covid-19 Articles, Clustered(K-Means), Tf-idf with Plain Text",
toolbar_location="right")
# plot
p.scatter('x', 'y', size=5,
source=source,
fill_color=mapper,
line_alpha=0.3,
line_color="black",
legend = 'labels')
# add callback to control
callback = CustomJS(args=dict(p=p, source=source), code="""
var radio_value = cb_obj.active;
var data = source.data;
x = data['x'];
y = data['y'];
x_backup = data['x_backup'];
y_backup = data['y_backup'];
labels = data['desc'];
if (radio_value == '20') {
for (i = 0; i < x.length; i++) {
x[i] = x_backup[i];
y[i] = y_backup[i];
}
}
else {
for (i = 0; i < x.length; i++) {
if(labels[i] == radio_value) {
x[i] = x_backup[i];
y[i] = y_backup[i];
} else {
x[i] = undefined;
y[i] = undefined;
}
}
}
source.change.emit();
""")
# callback for searchbar
keyword_callback = CustomJS(args=dict(p=p, source=source), code="""
var text_value = cb_obj.value;
var data = source.data;
x = data['x'];
y = data['y'];
x_backup = data['x_backup'];
y_backup = data['y_backup'];
abstract = data['abstract'];
titles = data['titles'];
authors = data['authors'];
journal = data['journal'];
for (i = 0; i < x.length; i++) {
if(abstract[i].includes(text_value) ||
titles[i].includes(text_value) ||
authors[i].includes(text_value) ||
journal[i].includes(text_value)) {
x[i] = x_backup[i];
y[i] = y_backup[i];
} else {
x[i] = undefined;
y[i] = undefined;
}
}
source.change.emit();
""")
# option
option = RadioButtonGroup(labels=["C-0", "C-1", "C-2",
"C-3", "C-4", "C-5",
"C-6", "C-7", "C-8",
"C-9", "C-10", "C-11",
"C-12", "C-13", "C-14",
"C-15", "C-16", "C-17",
"C-18", "C-19", "All"],
active=20)#, callback=callback)
# search box
keyword = TextInput(title="Search:")#, callback=keyword_callback)
#header
header = Div(text="""<h1>COVID-19 Literature Cluster</h1>""")
# show
show(column(header, widgetbox(option, keyword),p))